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Abstract 

We present a new design pattern for high-performance parallel scientific 
software, named coalesced communication. This pattern allows for a struc- 
tured way to improve the communication performance through coalescence of 
multiple communication needs using two communication management com- 
ponents. We apply the design pattern to several simulations of a lattice- 
Boltzmann blood flow solver with streaming visualisation which engenders a 
reduction in the communication overhead of approximately 40%. 
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1. Introduction 

High-performance parallel scientific software often consists of complex, 
multi-functional, multi-physics software components, run on infrastructures 
which are increasingly large and frequently hybrid in nature (e.g., featur- 
ing many-core architectures or distributed systems). Orchestrating the work 
of these components requires advanced software engineering and design ap- 
proaches to manage the attendant complexity. The result is that the struc- 
ture of high-performance computing codes is moving towards the use of 
higher-level design abstractions. One way to capture these design abstrac- 
tions is through the definition of design patterns. Design patterns are com- 



1 E-mail: hywel.carver.09@ucl.ac.uk (Hywel B. Carver), p.v.coveney@ucl.ac.uk (Peter 
V. Coveney) 



Preprint submitted to Parallel Computing 



October 17, 2012 



monly applied in software engineering [I] . They are formal definitions which 
describe a specific solution to a design problem, and can be found in a range 
of scientific and engineering disciplines. With high performance computing 
(HPC) codes growing in complexity, existing design patterns are more com- 
monly applied in HPC and numerous new design patterns have emerged [21 E]- 
Here we present a new design pattern: coalesced communication. In this 
pattern, each component registers the communication tasks it will require 
during the different stages, or steps, of execution with a central registry. We 
refer to each component which wishes to register communication requests as 
a Client. This registry analyses the required communications and combines 
requests from each Client at appropriate steps of the execution. This allows 
work of one Client (such as a scientific kernel) to overlap with the communi- 
cation of another Client (such as streaming visualisation or error correction), 
and results in a single synchronization point between processes during each 
step. 

Several groups have experimented with the coalescence of communica- 
tion, although none of these have developed this into a generalised design 
pattern. Bae et al. [4| benchmark the coalescence of communication as a 
factor influencing code complexity and efficiency within two algorithms. Bell 
et al. [5] investigate the performance benefit of overlapping communication 
with communication, which is an alternative method to reduce the number 
of synchronisation points. Chavarria et al. [B] implement a form of coales- 
cence in a High-Performance Fortran compiler for situations where one code 
location has multiple communication events, and find a reduction of up to 
55% in communication volume. Chen et al. [7] find similar performance im- 
provement when applying coalescing in programs written in Unified Parallel 
C, and Koop et al. [8 J report significant improvements in throughput when 
using low-level coalescence for sending small MPI messages. 

2. Coalesced communication 

The coalesced communication pattern is applicable to any parallel soft- 
ware which carries out multiple tasks, and therefore has a range of com- 
munication needs. These communication needs may, for example, include 
exchanges required for one or more scientific kernels, visualisation, steer- 
ing, dynamic domain decomposition, coupling with one or more external 
programs, introspection or error recovery. Of course, each of these Clients 
could do its own communication internally, but this can be highly inefficient 
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from a performance perspective due to the large number of synchronisation 
points with other processes. The coalesced communication pattern allows 
us to improve the communication performance by reducing the number of 
synchronisation points in an organised way. 

Within the coalesced communication pattern, each Client registers with 
an administrative object called the StepManager, and all communication is 
indirected through a central store of communication requirements called the 
CommunicationsManager object. The relations of these objects are shown in 
Figure [1} In each of several Steps, a call back is made to each Client to carry 
out those computations that are safe to perform during that step, while the 
CommunicationsManager object makes the appropriate MPI calls to initiate 
non-blocking message passing for each requested piece of communications. 
In this way, the communications of all Clients can be overlapped with their 
calculation, potentially providing substantial performance gains. In addition, 
the bundling of all the non-blocking communications reduces the number of 
synchonisation points here to one. 

We present the sequence of events for an application with two Clients in 
Figure [2} Here we see computation callbacks preceding and following each of 
the MPI send, receive, and wait calls. For example, computation callbacks 
are made to each Client after the CommunicationsManager makes the MPI 
send calls, while it waits to receive the incoming data. The incoming data 
are placed into buffers registered with the CommunicationsManager at the 
beginning of each step, but the data is only safe to use following completion 
of the Wait call made by CommunicationsManager. 

3. Implementation 

We have implemented the coalesced communication design pattern within 
the HemeLB lattice-Boltzmann simulation environment, which is intended 
to accurately model cerebrovascular blood flow. HemeLB is written in C++ 
and aims to provide timely and clinically relevant assistance to neurosur- 
geons [9]. HemeLB contains a range of functionalities, including the core 
lattice-Boltzmann kernel, visualisation modules and a steering component 
which allows for interactive use of the application. HemeLB has been shown 
to efficiently model sparse geometries using up to at least 32,768 compute 
cores [10]; inter alia, has been used for a variety of scenarios [HI |9j. 

The primary Clients registered with the StepManager within HemeLB 
are those raised by the core lattice-Boltzmann kernel, an in situ visualisa- 
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Figure 1: Entity relationship diagram of the coalesced communication design pattern. 

tion module and an module for introspective monitoring. However, HemeLB 
will frequently run with additional Clients as there are a number of optional 
modules, such as the computational steering server. Within this article we fo- 
cus on only the core lattice-Boltzmann communications and the visualisation 
communications. 

4. Performance Tests 

We have run HemeLB on 1024 cores on the HECToR Cray XE6 machine 
in Edinburgh, United Kingdom, using a sparse cerebrovascular bifurcation 
simulation domain which contains 19,808,107 fluid sites. Our simulations 
run for 2000 steps with three different settings, rendering respectively 10, 
100 and 200 images using the visualisation module. We repeated each run 
both with and without coalesced communication enabled, using a compile- 
time parameter to toggle this functionality. We measured the total time spent 
on the simulation, on all communications, and on local operations required 
for constructing the images. 

We present the results of our performance tests in Table [T] Based on our 
measurements we find that the communication overhead in our coalesced runs 
amounts to between 57 and 63% of the overhead in the non-coalesced runs. 
When we render more images per timestep, the absolute performance bene- 
fit increases while relative performance benefit slightly decreases. However, 
the frame rate we obtain for the runs with 200 images generated is already 
sufficient for real-time visual inspection of the data. The time spent on vi- 
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Figure 2: Message sequence chart of the coalesced communication pattern, generalized for 
an application with two Client components which require communications. Function calls 
and data movements arc indicated respectively with solid and dashed arrows. The Step- 
Manager and CommunicationsManager objects are abbreviated respectively as StepMgr 
and CommsMgr. Time proceeds vertically downwards. 

sualisation is 0.0034 second per image, and scales linearly with the number 
of images rendered. 

5. Discussion and conclusions 

We have presented the coalesced communication design pattern, which 
allows the coalescence of the interprocess communications of multiple Client 
components within complex parallel scientific software. We have demon- 
strated the benefit of adopting the design pattern based on an implementa- 
tion in a blood flow application. Here the use of coalesced communication 
reduces the total communication overhead of the simulations, which have two 
primary Clients, by approximately 40%. This improvement results in the ap- 
plication taking about 7% less time overall, making it more responsive when 
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Table 1: Performance results of our HemeLB simulations, run with and without the coa- 
lesced communication strategy. Each simulation ran for 2000 time steps, using 1024 cores 
and modelling blood flow in a bifurcation simulation domain. We ran our simulations 
rendering respectively 10 images (first two rows), 100 images (middle two rows), and 200 
images (last two rows) at evenly spaced time intervals during execution. 
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applied for clinical or scientific purposes. The design pattern can be directly 
applied in other parallel scientific software projects, allowing for a structured 
way to improve the communication performance through coalescence. 
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